Phase 1: Data Preparation & Visualisation¶

Group Name: 21¶

Steven Garrett - s3910501¶

Table of Contents

  • Introduction
  • Dataset Information and Source
  • Dataset Features
  • Target Feature
  • Goals and Objectives
  • Data Cleaning and Preprocessing
  • Data Exploration and Visualisation
  • Summary and Conclusion
  • References

Introduction ¶

This report is the first of two phases in a project whose objective is to predict the target features of Envifonmental, Social, and Govnernance Rankings as provided by the investment researcher Sustainalytics.

All data are freely available online through Yahoo Finance and with the assistance of scripting in RStudio we are able to merge, wrangle, sample, and visualise the data whose features will be utilised in the calibration of the prediction model.

Six visualisations in total are presented across a scope of three, two, and one variable. Each variable set contains two plots. This is not intended to be an exhaustive list.

Dataset Information and Source¶

There are three sources of data which have been combined into a single dataframe by Stock Symbol (primary key). The sources are outlined as follows;

Company Specfic Data¶
  • The SP500 Index data can be located via YF and also at the S&P GLobal Site.
  • The data cover 503 companies within the Standard & Poor 500 Index (SP500) as at January 2023.
  • This data is refreshed periodically, and evolves to encapsulate current market moevements and trends.
  • The index breaks down into 117 Industries within 11 Sectors.
  • Being mindful of small sample sizes and the Central Limit Theorum we will initially aggregate at a Sector level for cleaning and visualisation.
Non-financial (ESG) Measurement Data¶
  • The SP500 data were extracted utilising RStudio scripting by Kyle Rudden.
  • The data we are analysing for this purpose has been made available by Sustainalytics (Sust) via the Yahoo Finance website (YFin).
  • Data collected primarily as at August 2022 with a split minority across - March, May and Septemeber of 2022.
  • Data are organised to represent (Env), (Soc), (Gov), (Con), and (Tot) Risk ratings on a per-constituent basis across the index.
  • As each relevant company has only one measurement in the sample we note there to be no time-series data present - the data presented is as a static snapshot only.
Financial Measurement Data¶
  • Data covers the period from Dec 1 2019 through Dec 31 2022 and is sourced from YF via yfinance.
  • Financial Data breaks down into two categories: Investment Performance and Company Measurement.
  • We have sampled 3 years of data optimistically, with an expectation that a minority of companies in the index may be dropped due to insufficient data.
  • Financial measurements are calculated on 1 and 3-year windows.
  • Investment Performance Data;
    • Metrics include;
      • Historic Performance: What return on ivestment over which period?
      • Historic Risk: How much volatility over which period?
  • Company Measurement Data
    • Metrics include;
      • Market Capitalisation: How large is the compay as a measurement of its outstanding shares by current share price?,
      • Dividend Yield: What is the return to investor via dividend of the individual security proportional to its price?,
      • Earnings Per Share: What is the proportionality of company earnings per oustanding share,
      • Beta: Where over 1.0 is relatively more risk, and under 1.0 is relatively lower risk, what is the 'risk' of the stock price?.

Retrieval¶

  • The data from Sustainalityics (Sust) have been scraped from Yahoo Finance (YF) and are presently stored in a csv.
  • The financial data has been accessed via the yfinance package sourcing YF.

Data Cleaning and Processing¶

Overview¶

  • The Index exists as an aggregate sum-of-parts model where the only expectation we hold is of being an ordinal list holding the biggest ~500 companies.
  • We note that:
    • This list as presented affords all companies an initial equal 'weighting' by either being a part of the index (or not), and
    • Data for the companies financial and non-financial risk (ESG) measurements over the period is complete (or not).
  • There are 501 Index constituents at inception. This drops 409 after removing those with missing ESG Data or insufficient investment performance History.
  • We note this to be a possible source of survivorship bias and will explore this differential in the Data Eploration stage.
  • We observe the variables as given and operate the following data hygiene steps:
    • String analysis and manipulation (where required) to remove any erronous data,
    • Summary statistics are conducted to describe the quantiative values in search of any obvious outliers or human error.

Variable Specific Notes¶

  • We note that there are variables to be dropped which denote the data collection activity itself. They are unrelated to the desired statistical interrogation ahead are ScrapDate & Link.
  • We note that primary key values pertaining to Company Name (dropped) and Symbol (set to index) are dropped from analysis. The Symbols are used for indexing and joining of dataframes.
  • We note no obivous errors in the quantitative variables and note that there will most likely be outliers once variables are standardised.
  • We note no obvious errors in the catagorical variables. This is unsurpsiging as the data has been professionally reviewed prior to publication to limit errors and omitions.
In [ ]:
# Import Packages for Wrangling
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import io
import requests

pd.set_option('display.max_columns', None) # so that we can see all the columns

# Read in the ESG data
df_source = "C:/Users/sbgrrtt-pc/Google Drive/__RMIT/23S1/MATH2319 - Machine Learning NEW/ML Practical/A_Phase 1/23.01.19 - ESG Data SP500 Sustainalytics.csv"

esgData = pd.read_csv(df_source,index_col='Index')

print(esgData.shape) # Define the shape of the data.

print(esgData.dtypes) # Observe all feature types as is.

# We note that ScrapDate and LastUpdateDate are listed as objects and not dates.
# This will be appropriate for our scope as the data in question is a snapshot at a point in time and not a time-series.
# We move to drop the variables we will not require at all.

esgData = esgData.drop(columns=["ScrapDate"])
esgData = esgData.drop(columns=["Link"])

SP500constituents = esgData['Symbol'].to_list() # Create list of PK items for scraping

esgData['LastUpdateDate'].unique() # Review Last Update Dates

import yfinance as yf # Importing for price data

# Define the start and end dates for the 5-year period
start_date = '2019-12-31'
end_date = '2022-12-31'


# Find adjusted close prices for each Symbol over the specified period
priceData = yf.download(SP500constituents, start=start_date, end=end_date)['Adj Close'] 

priceData.to_csv('SP500priceData.csv') # Save the price data to CSV

priceData = priceData.dropna(axis=1) # Drop variables without complete priceData available

priceData.head(25) # Review DF

# Calculate log returns over 1, and 3 years
log_returns_1d = np.log(priceData).diff().dropna()
log_returns_1d.shape


tdays = 252 # Estimate Trading Days per annum
totalDays = len(log_returns_1d) # Number of last trading Day

# Subset the data into 1 and 3 year chunks. Cumulative return ~ sum over time.
lnRet1yData = log_returns_1d.iloc[totalDays-(1*tdays):totalDays]
lnRet3yData = log_returns_1d.iloc[totalDays-(3*tdays):totalDays]

# Calc Standard Deviation of daily log returns per grouping
log_returns_1y = lnRet1yData.sum()
log_returns_3y = lnRet3yData.sum()

# Calculate standard deviations over 1, and 3 years
stdev_1y = lnRet1yData.std()
stdev_3y = lnRet3yData.std()

# Create a DataFrame to store the results
invPerf = pd.DataFrame({'StDev_1y': stdev_1y, 'StDev_3y': stdev_3y,'LnRet_1y':log_returns_1y,'LnRet_3y' : log_returns_3y})

SP500complete = invPerf.index.tolist() # refresh

# Create an empty DataFrame to store the data
compPerf = pd.DataFrame(columns=['Symbol', 'MarketCap', 'Beta', 'EPS'])

# Loop through each ticker and extract the necessary data
for ticker in SP500complete:
    # Fetch the stock data for the ticker
    stock_data = yf.Ticker(ticker)
    
    # Extract the market capitalisation, beta, and earnings per share
    market_cap = stock_data.info.get('marketCap')
    beta = stock_data.info.get('beta')
    eps = stock_data.info.get('trailingEps')
    
    # Create a dictionary to store the data
    compDict = {'Symbol': ticker,'MarketCap': market_cap,'Beta': beta,'EPS': eps}
    
    # Append the dictionary to the DataFrame
    compPerf = compPerf.append(compDict, ignore_index=True)

# set Symbol as index for joining to workingData
esgData.set_index('Symbol',inplace = True)
compPerf.set_index('Symbol', inplace=True)

# Save scraped data to CSV as checkpoint
compPerf.to_csv('compPerf.csv')
invPerf.to_csv('invPerf.csv')
(501, 12)
Symbol                   object
Company                  object
Sector                   object
Industry                 object
Link                     object
ScrapDate                object
TotalESGScore           float64
EnvironmentRiskScore    float64
SocialRiskScore         float64
GovernanceRiskScore     float64
ControversyLevel        float64
LastUpdateDate           object
dtype: object
[*********************100%***********************]  501 of 501 completed

2 Failed downloads:
- SIVB: No data found, symbol may be delisted
- FRC: Data doesn't exist for startDate = 1577768400, endDate = 1672462800
Company                  0
Sector                   0
Industry                 0
TotalESGScore           62
EnvironmentRiskScore    62
SocialRiskScore         62
GovernanceRiskScore     62
ControversyLevel        62
LastUpdateDate          62
MarketCap                8
Beta                    28
EPS                     10
StDev_1y                 8
StDev_3y                 8
LnRet_1y                 8
LnRet_3y                 8
dtype: int64
Company                 0
Sector                  0
Industry                0
TotalESGScore           0
EnvironmentRiskScore    0
SocialRiskScore         0
GovernanceRiskScore     0
ControversyLevel        0
LastUpdateDate          0
MarketCap               0
Beta                    0
EPS                     0
StDev_1y                0
StDev_3y                0
LnRet_1y                0
LnRet_3y                0
dtype: int64
In [ ]:
# Join dataframes (esgData, compPerf, invPerf) to combine for one
workingData = esgData.join(compPerf, how = 'left')
workingData = workingData.join(invPerf,how = 'left')
In [ ]:
print(workingData.isnull().sum()) # review whats missing at a high level
len(workingData) # 501 rows in index

workingData = workingData.dropna() # reduces from 503 to ~ 421 entries of complete data, ~15% missing

workingData = workingData.drop(columns='Company') # Remove individual names
# workingData = workingData.reset_index(drop=True, inplace=True) # remove ticker from index

print(workingData.isnull().sum()) # review whats missing at a high level again

sampleSize = len(workingData)

print("Sample size for the tuples with sufficient data is "+str(sampleSize))
Company                  0
Sector                   0
Industry                 0
TotalESGScore           62
EnvironmentRiskScore    62
SocialRiskScore         62
GovernanceRiskScore     62
ControversyLevel        62
LastUpdateDate          62
MarketCap                8
Beta                    28
EPS                     10
StDev_1y                 8
StDev_3y                 8
LnRet_1y                 8
LnRet_3y                 8
dtype: int64
Sector                  0
Industry                0
TotalESGScore           0
EnvironmentRiskScore    0
SocialRiskScore         0
GovernanceRiskScore     0
ControversyLevel        0
LastUpdateDate          0
MarketCap               0
Beta                    0
EPS                     0
StDev_1y                0
StDev_3y                0
LnRet_1y                0
LnRet_3y                0
dtype: int64
Sample size for the tuples with sufficient data is 421
In [ ]:
# Fix up data types
workingData['MarketCap'] = workingData['MarketCap'].astype(float)
workingData['Beta'] = workingData['Beta'].astype(float)
workingData['EPS'] = workingData['EPS'].astype(float)
workingData['ControversyLevel'] = workingData['ControversyLevel'].astype(str)

# SAVE TO CSV AS A CHECKPOINT
workingData.to_csv('workingDataCleaned.csv')
In [ ]:
workingData.sample(25) # Review sample of DF
Out[ ]:
Sector Industry TotalESGScore EnvironmentRiskScore SocialRiskScore GovernanceRiskScore ControversyLevel LastUpdateDate MarketCap Beta EPS StDev_1y StDev_3y LnRet_1y LnRet_3y
Symbol
CRM Information Technology Application Software 13.0 2.3 6.4 4.5 2.0 Aug-22 2.059473e+11 1.243842 0.21 0.029686 0.027596 -0.655295 -0.204278
VMC Materials Construction Materials 29.0 15.6 4.6 8.9 2.0 Aug-22 2.624190e+10 0.743138 4.67 0.019513 0.024519 -0.157319 0.223908
GWW Industrials Industrial Machinery 15.0 3.4 5.5 5.7 1.0 Aug-22 3.420286e+10 1.191766 32.59 0.017887 0.020330 0.086911 0.542710
DE Industrials Agricultural & Farm Machinery 17.0 3.2 6.9 6.6 2.0 Aug-22 1.077279e+11 1.037297 26.39 0.022361 0.023664 0.238272 0.946467
FIS Information Technology Data Processing & Outsourced Services 18.0 1.3 11.1 5.7 1.0 Aug-22 3.367412e+10 0.832160 -27.56 0.030482 0.026053 -0.461202 -0.673980
TT Industrials Building Products 16.0 7.1 3.8 5.2 2.0 Aug-22 3.951685e+10 1.180154 7.80 0.019284 0.021483 -0.160537 0.537728
CDNS Information Technology Application Software 12.0 1.4 6.1 4.8 0.0 Aug-22 5.909335e+10 1.068261 3.18 0.025643 0.025444 -0.158605 0.839855
BDX Health Care Health Care Equipment 27.0 3.7 15.6 7.5 3.0 Aug-22 7.153770e+10 0.564743 5.44 0.015262 0.017328 0.042520 -0.001725
VFC Consumer Discretionary Apparel, Accessories & Luxury Goods 13.0 0.9 6.8 5.1 2.0 Aug-22 7.411689e+09 1.484166 1.07 0.027905 0.028636 -0.925404 -1.182015
C Financials Diversified Banks 27.0 1.8 13.1 11.8 4.0 Aug-22 8.898594e+10 1.563335 7.28 0.020801 0.028997 -0.250313 -0.459838
UNP Industrials Railroads 25.0 9.4 12.1 3.7 2.0 Aug-22 1.213110e+11 1.104855 11.30 0.016526 0.019575 -0.164033 0.198410
NTRS Financials Asset Management & Custody Banks 24.0 1.9 11.8 10.7 2.0 Aug-22 1.504229e+10 1.084095 5.78 0.021734 0.024666 -0.270360 -0.094560
JNJ Health Care Pharmaceuticals 25.0 0.7 15.8 8.6 4.0 Aug-22 4.129642e+11 0.534482 4.76 0.010927 0.013715 0.050809 0.269101
ALB Materials Specialty Chemicals 29.0 15.5 5.5 7.7 2.0 Aug-22 2.395414e+10 1.535890 30.32 0.035057 0.034295 -0.071055 1.119612
GE Industrials Industrial Conglomerates 41.0 14.1 15.0 11.6 3.0 Aug-22 1.136726e+11 1.205137 7.27 0.022120 0.028856 -0.118103 -0.050954
IP Materials Paper Packaging 24.0 14.3 4.5 5.1 3.0 Aug-22 1.109194e+10 1.021463 4.52 0.018580 0.023410 -0.255541 -0.098055
O Real Estate Retail REITs 14.0 4.0 5.5 4.8 0.0 Aug-22 4.037313e+10 0.802955 1.42 0.014009 0.024415 -0.077590 0.014969
DAL Industrials Airlines 26.0 9.5 11.5 5.2 2.0 Aug-22 2.306709e+10 1.249824 2.94 0.029058 0.036677 -0.172331 -0.569542
WHR Consumer Discretionary Household Appliances 16.0 4.8 7.1 4.0 2.0 Aug-22 7.196830e+09 1.489795 -35.22 0.024665 0.028517 -0.455638 0.057785
LLY Health Care Pharmaceuticals 32.0 3.4 17.4 11.6 2.0 Aug-22 4.199394e+11 0.366995 6.33 0.017065 0.021377 0.290851 1.072193
CCI Real Estate Specialized REITs 13.0 5.2 2.9 5.1 0.0 Aug-22 4.903952e+10 0.661857 3.83 0.019230 0.020459 -0.390170 0.050614
CMI Industrials Industrial Machinery 19.0 5.8 7.5 6.1 2.0 Aug-22 3.067365e+10 0.999296 17.75 0.017708 0.020567 0.133158 0.381079
ATVI Communication Services Interactive Home Entertainment 19.0 0.2 12.5 6.3 4.0 Aug-22 6.178423e+10 0.416256 2.35 0.017374 0.020824 0.131891 0.270625
KMX Consumer Discretionary Automotive Retail 12.0 0.1 7.9 4.0 1.0 Aug-22 1.134317e+10 1.425756 2.94 0.033689 0.031607 -0.744445 -0.364511
WY Real Estate Specialized REITs 18.0 10.5 3.3 3.8 2.0 Aug-22 2.166864e+10 1.491907 1.71 0.019511 0.029580 -0.223569 0.132287
In [ ]:
# Sector breakdown
print( len( (workingData[ 'Sector' ].unique() ) ) )  # HOW MANY SECTORS ARE THERE? 11
print(workingData['Sector' ].unique()) # WHAT ARE THE SECTORS?.
11
['Health Care' 'Industrials' 'Consumer Discretionary'
 'Information Technology' 'Financials' 'Consumer Staples' 'Utilities'
 'Materials' 'Real Estate' 'Communication Services' 'Energy']
In [ ]:
# Industry breakdown
print( len( (workingData[ 'Industry' ].unique() ) ) ) # 118 INDUSTRY clasifiers?
print( workingData[ 'Industry' ].unique() ) # LABELS OF INDUSTRY
118
['Health Care Equipment' 'Airlines' 'Automotive Retail'
 'Technology Hardware, Storage & Peripherals' 'Pharmaceuticals'
 'Health Care Distributors' 'Reinsurance' 'IT Consulting & Other Services'
 'Application Software' 'Semiconductors' 'Agricultural Products'
 'Data Processing & Outsourced Services' 'Multi-Utilities'
 'Independent Power Producers & Energy Traders' 'Life & Health Insurance'
 'Property & Casualty Insurance' 'Multi-line Insurance'
 'Insurance Brokers' 'Internet Services & Infrastructure'
 'Specialty Chemicals' 'Building Products'
 'Electrical Components & Equipment' 'Asset Management & Custody Banks'
 'Specialized REITs' 'Internet & Direct Marketing Retail'
 'Communications Equipment' 'Industrial Gases' 'Electronic Components'
 'Auto Parts & Equipment' 'Office REITs' 'Gas Utilities'
 'Interactive Home Entertainment' 'Residential REITs' 'Paper Packaging'
 'Water Utilities' 'Consumer Finance' 'Specialty Stores'
 'Aerospace & Defense' 'Diversified Banks' 'Metal & Glass Containers'
 'Computer & Electronics Retail' 'Distillers & Vintners' 'Biotechnology'
 'Life Sciences Tools & Services' 'Multi-Sector Holdings'
 'Packaged Foods & Meats' 'Financial Exchanges & Data'
 'Real Estate Services' 'Hotels, Resorts & Cruise Lines'
 'Technology Distributors' 'Fertilizers & Agricultural Chemicals'
 'Regional Banks' 'Household Products' 'Air Freight & Logistics'
 'Cable & Satellite' 'Managed Health Care' 'Restaurants'
 'Industrial Machinery' 'Health Care Supplies'
 'Oil & Gas Exploration & Production' 'Hypermarkets & Super Centers'
 'Diversified Support Services' 'Research & Consulting Services'
 'Railroads' 'Health Care Services' 'Integrated Oil & Gas'
 'Electric Utilities' 'Agricultural & Farm Machinery'
 'General Merchandise Stores' 'Homebuilding' 'Movies & Entertainment'
 'Health Care Facilities' 'Personal Products' 'Diversified Chemicals'
 'Automobile Manufacturers' 'Copper' 'Retail REITs' 'Systems Software'
 'Industrial Conglomerates' 'Interactive Media & Services'
 'Consumer Electronics' 'Investment Banking & Brokerage'
 'Oil & Gas Equipment & Services' 'Leisure Products'
 'Hotel & Resort REITs' 'Advertising' 'Construction & Engineering'
 'Trucking' 'Semiconductor Equipment' 'Oil & Gas Storage & Transportation'
 'Soft Drinks' 'Food Retail' 'Distributors' 'Home Improvement Retail'
 'Alternative Carriers' 'Casinos & Gaming' 'Home Furnishings'
 'Construction Materials' 'Tobacco' 'Oil & Gas Refining & Marketing'
 'Gold' 'Apparel, Accessories & Luxury Goods' 'Steel'
 'Housewares & Specialties' 'Construction Machinery & Heavy Trucks'
 'Health Care REITs' 'Industrial REITs'
 'Human Resource & Employment Services'
 'Environmental & Facilities Services'
 'Electronic Equipment & Instruments' 'Food Distributors'
 'Integrated Telecommunication Services' 'Brewers'
 'Electronic Manufacturing Services' 'Wireless Telecommunication Services'
 'Trading Companies & Distributors' 'Drug Retail' 'Household Appliances']
In [ ]:
print(workingData.dtypes) # Review data type for each variable
Sector                   object
Industry                 object
TotalESGScore           float64
EnvironmentRiskScore    float64
SocialRiskScore         float64
GovernanceRiskScore     float64
ControversyLevel         object
LastUpdateDate           object
MarketCap               float64
Beta                    float64
EPS                     float64
StDev_1y                float64
StDev_3y                float64
LnRet_1y                float64
LnRet_3y                float64
LogMarketCap            float64
dtype: object
In [ ]:
print(workingData.isnull().sum()) # Review missing columns
Sector                  0
Industry                0
TotalESGScore           0
EnvironmentRiskScore    0
SocialRiskScore         0
GovernanceRiskScore     0
ControversyLevel        0
LastUpdateDate          0
MarketCap               0
Beta                    0
EPS                     0
StDev_1y                0
StDev_3y                0
LnRet_1y                0
LnRet_3y                0
LogMarketCap            0
dtype: int64
Existing Features¶

The dataset contains a mix of variable types covering financial and non-financial measurements as outlined in the table below.

In [ ]:
from tabulate import tabulate

table =[['Variable','Data Split','Data Type','Description'],
['Symbol','Qualitative','Nominal','Unique Stock Identifier'],
['Company','Qualitative','Nominal','Stock Name'],
['Sector','Qualitative','Categorical','The highest grouping in the GICS* market segmentation methodology.'],
['Industry','Qualitative','Categorical','The second highest grouping in the GICS* market segmentation methodology.'],
['TotalESGScore','Quantitative','Derived','Sum of Environmental, Social, Government Scores. **'],
['EnvironmentRiskScore','Quantitative','Discrete','Measurement regarding the Environmental Risks each company is specifically exposed to.'],
['SocialRiskScore','Quantitative','Discrete','Measurement regarding the Social Risks each company is specifically exposed to.'],
['GovernanceRiskScore','Quantitative','Discrete','Measurement regarding the Governance Risks each company is specifically exposed to.'],
['ControversyLevel','Qualitative','Categorical','Measurement regarding the Controversy each company is specifically exposed to. \
\n Scores range from 0 (negligible) to 5 (severe)'],
['LastUpdateDate','Quantitative','Date','Date Score Provided to Company'],
['MarketCap','Quantitative','Discrete','The value of outstanding common shares owned by stockholders of a publicly traded company \
\n Market capitalization is derived as a product of the market price per common share and the number of common shares outstanding.'],
['Beta','Quantitative','Discrete','A measurement that shows how value moves with regards to a relative measure for an individual asset,\
commonly an index the stock exists within. Higher than 1 indicates a higher relative movement for the asset compared to its benchmark, lower than 1 indicates the opposite.'],
['EPS','Quantitative','Discrete','Earnings per share (EPS) indicates the overall profitability of a company on a per share basis. \
\n The measurement is derived by dividing the profit over the outstanding common stock shares for a company.\
\n The profit may be adjusted to account for less regular transactions that would otherwise skew the figure and not commonly represent the businesses primary activity.\
\n A relatively high EPS figure indicates a relatively more profitable company when contrasting against either peers at the same point in time or ones prior results.'],
['StDev_1y','Quantitative','Discrete','Standard Deviation of daily log Return for 1y'],
['StDev_3y','Quantitative','Discrete','Standard Deviation of daily log Return for 3y'],
['LnRet_1y','Quantitative','Discrete','Sum of daily log returns for 1y'],
['LnRet_3y','Quantitative','Discrete','Sum of daily log returns for 3y']]


print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))
╒══════════════════════╤══════════════╤═════════════╤═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Variable             │ Data Split   │ Data Type   │ Description                                                                                                                                                                                                                                                                     │
╞══════════════════════╪══════════════╪═════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Symbol               │ Qualitative  │ Nominal     │ Unique Stock Identifier                                                                                                                                                                                                                                                         │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Company              │ Qualitative  │ Nominal     │ Stock Name                                                                                                                                                                                                                                                                      │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sector               │ Qualitative  │ Categorical │ The highest grouping in the GICS* market segmentation methodology.                                                                                                                                                                                                              │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Industry             │ Qualitative  │ Categorical │ The second highest grouping in the GICS* market segmentation methodology.                                                                                                                                                                                                       │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ TotalESGScore        │ Quantitative │ Derived     │ Sum of Environmental, Social, Government Scores. **                                                                                                                                                                                                                             │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ EnvironmentRiskScore │ Quantitative │ Discrete    │ Measurement regarding the Environmental Risks each company is specifically exposed to.                                                                                                                                                                                          │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ SocialRiskScore      │ Quantitative │ Discrete    │ Measurement regarding the Social Risks each company is specifically exposed to.                                                                                                                                                                                                 │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ GovernanceRiskScore  │ Quantitative │ Discrete    │ Measurement regarding the Governance Risks each company is specifically exposed to.                                                                                                                                                                                             │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ControversyLevel     │ Qualitative  │ Categorical │ Measurement regarding the Controversy each company is specifically exposed to.                                                                                                                                                                                                  │
│                      │              │             │  Scores range from 0 (negligible) to 5 (severe)                                                                                                                                                                                                                                 │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ LastUpdateDate       │ Quantitative │ Date        │ Date Score Provided to Company                                                                                                                                                                                                                                                  │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ MarketCap            │ Quantitative │ Discrete    │ The value of outstanding common shares owned by stockholders of a publicly traded company                                                                                                                                                                                       │
│                      │              │             │  Market capitalization is derived as a product of the market price per common share and the number of common shares outstanding.                                                                                                                                                │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Beta                 │ Quantitative │ Discrete    │ A measurement that shows how value moves with regards to a relative measure for an individual asset,commonly an index the stock exists within. Higher than 1 indicates a higher relative movement for the asset compared to its benchmark, lower than 1 indicates the opposite. │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ EPS                  │ Quantitative │ Discrete    │ Earnings per share (EPS) indicates the overall profitability of a company on a per share basis.                                                                                                                                                                                 │
│                      │              │             │  The measurement is derived by dividing the profit over the outstanding common stock shares for a company.                                                                                                                                                                      │
│                      │              │             │  The profit may be adjusted to account for less regular transactions that would otherwise skew the figure and not commonly represent the businesses primary activity.                                                                                                           │
│                      │              │             │  A relatively high EPS figure indicates a relatively more profitable company when contrasting against either peers at the same point in time or ones prior results.                                                                                                             │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ StDev_1y             │ Quantitative │ Discrete    │ Standard Deviation of daily log Return for 1y                                                                                                                                                                                                                                   │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ StDev_3y             │ Quantitative │ Discrete    │ Standard Deviation of daily log Return for 3y                                                                                                                                                                                                                                   │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ LnRet_1y             │ Quantitative │ Discrete    │ Sum of daily log returns for 1y                                                                                                                                                                                                                                                 │
├──────────────────────┼──────────────┼─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ LnRet_3y             │ Quantitative │ Discrete    │ Sum of daily log returns for 3y                                                                                                                                                                                                                                                 │
╘══════════════════════╧══════════════╧═════════════╧═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

*The Global Industry Classification Standard (GICS®) exists as an annually updated industry analysis framework to assist investors domicile business by their key business activities. It has been managed by MSCI and S&P Dow Jones Indices with a goal to define consistent and exhaustive industry definitions. GICS framework consistes of a hierarchical industry classification system with 4 main levels; Sectors, Industry Groups, Industries and Sub-Industries. The scope of this analysis is set to the two highest levels (Sector & Industry). All definitions are standardized and applied to companies globally. Each company is assigned a single GICS® classification in each of the four tiers, according to its principal business activity which in turn is predominately defined by sources of revenue.

**Sustainalytics’ ESG Risk Ratings assess the degree to which a company’s enterprise business value is at risk driven by environmental, social and governance issues. The rating employs a two-dimensional framework that combines an assessment of a company’s exposure to industry-specific material ESG issues with an assessment of how well the company is managing those issues. The final ESG Risk Ratings scores are a measure of unmanaged risk on an absolute scale of 0-100, with a lower score signaling less unmanaged ESG Risk. The ESG Risk Ratings are categorized across five risk levels: negligible (0-10), low (10-20), medium (20-30), high (30-40) and severe (40+).

Reviewing the Quantiative Information¶

Below we observe the breakdown across the quantitative information.

In [ ]:
print(workingData.describe().round(3)) # Print and review quantitative information
       TotalESGScore  EnvironmentRiskScore  SocialRiskScore  \
count        421.000               421.000          421.000   
mean          21.553                 5.649            9.128   
std            7.059                 5.224            3.782   
min            7.000                 0.000            1.100   
25%           16.000                 1.500            6.600   
50%           21.000                 3.700            8.900   
75%           26.000                 8.900           11.600   
max           46.000                23.900           21.000   

       GovernanceRiskScore     MarketCap     Beta      EPS  StDev_1y  \
count              421.000  4.210000e+02  421.000  421.000   421.000   
mean                 6.779  8.065465e+10    1.032    8.160     0.022   
std                  2.221  2.164240e+11    0.403   25.538     0.006   
min                  3.100  8.810063e+06    0.163  -35.220     0.011   
25%                  5.300  1.684141e+10    0.753    2.300     0.017   
50%                  6.200  3.238837e+10    1.032    4.760     0.021   
75%                  7.800  6.585069e+10    1.280    8.460     0.025   
max                 15.500  2.755039e+12    2.490  480.420     0.049   

       StDev_3y  LnRet_1y  LnRet_3y  LogMarketCap  
count   421.000   421.000   421.000       421.000  
mean      0.025    -0.131     0.203        10.555  
std       0.007     0.294     0.388         0.495  
min       0.013    -1.063    -1.830         6.945  
25%       0.020    -0.309     0.003        10.226  
50%       0.023    -0.118     0.225        10.510  
75%       0.028     0.055     0.416        10.819  
max       0.059     0.664     1.485        12.440  

Target Feature¶

The target features of this project will be the Environmental, Social, and Governance ratings of the companies within the Index as additive parts towards a Total score.

Research Objectives¶

Listed companies are increasingly adopting Environmental, Social, and Governance risk measurement in a drive for enhanced transparency surrounding non-financial risk for the benefit of stakeholders and industry at large.

Non-financial risk measurement is conducted by different specialist firms who differentiate their product by employing specific and proprietary scoring mechanisms for commonly listed companies.

The scope of this analysis will be domiciled to the ESG conducted by Sustainalytics (Research Provider) across the S&P 500 Index (Index). Investment performance metrics have been aggregated using built in computational packages utilisting RStudio across the index.

The primary objective here-in is to exlpore how combinations of investment performance and non-financial risk measurement metrics across the Index predict the possible ESG scores for companies by Sector for this specific research provider.

A secondary objective for this investigation is to better understand what predictor values may add value to the accuracy of predictions where we are to extrapolate estimates across a broader universe of listed companies.

Data Exploration and Visualisation¶

In [ ]:
# Load in Data Visualisation packages

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")

# Set colour scheme for Controversy Levels
num_steps = len(workingData['ControversyLevel'].unique())
trafficLights = sns.color_palette("RdYlGn", num_steps)[::-1]
Out[ ]:
0.8403193612774451

Single Variable Plots¶

  • Figure 1: By Sector, what is the breakdown of companies we will review? (One Variable 1)
    • Given the competitive nature of this index we expect that over different periods of the economic cycle its proprortionality will move around.
    • We undestand that starting with 501 variables and then dropping to 421 constitutes a retention of ~85% values overall.
    • For this reason we will first look at count of companies grouped by sector with sufficient data for analysis.
    • We note the least represented sectors are Communication Services, Energy, and Materials.
    • We note 5 sectors sit at or over 50 constituents each which provides context to how the data may be skewed where there are 409 datapoints.
    • The most represented sectors in descending order are; Information Technology, Financials, Industrials, Consumer Discretionary, and Healthcare.
In [ ]:
# one-variable plots

# WE HAVE ~400 OF 501 DATAPOINTS FOR THE INDEX, WHAT IS THE BREAKDOWN PER INDUSTRY OF THOSE WITH DATA ?

# p1 - histogram of Industry Freq

p1Data = workingData[['Sector', 'Industry']].groupby(['Sector']).count().reset_index(drop=False)

# Define vectors for chart
p1Sector = p1Data['Sector']
p1IndustryN = p1Data['Industry']

# Plot histogram
plt.figure(figsize = (15, 10))
plt.bar(p1Sector, p1IndustryN)

# Add Hist Labels
plt.xlabel('Sector')
plt.ylabel('Industry Count w/Data')
plt.title('Figure 1: By Sector, what is the breakdown of companies we will review? (n~'+str(sampleSize)+')')


plt.xticks(rotation=90) # Rotating x-axis labels for better readability


plt.show()# Displaying the histogram
  • Figure 2: How much data is missing from each Sector? (One Variable 2)
    • We now look at the data which did not meet the requirements to asses possible sources of survivorship bias.
    • We want to know how much Data is missing per sector.
    • Where all sectors were missing a proportionally similar amount of data we could feel more confident that no one sector was specifically mis-represented.
    • In this instance, however, we note that Communication Services has around two fifths of its constituents to the index missing.
    • This carries implications towards possible learning bias and summary statistical analysis due to the large drop in an already small sample size of 25.
    • Excluding Communication Services We observe a range of 7-21% data missing per sector with Consumer Discretionary and Consumer Staples being the most complete.
In [ ]:
# WE HAVE ~80 OF 501 DATAPOINTS MISSING FOR THE INDEX, WHAT IS THE BREAKDOWN PER INDUSTRY OF THOSE?

# Create group by summaries of the working and starting data grouped by Sector for gap analysis


p2Data = esgData[['Sector','Industry']].groupby(['Sector']).count()
p2Data = p2Data.reset_index(drop=False)
p2Data = p2Data.rename(columns={'Industry': 'IndCountALL'})
p2Data['IndCountComplete'] = p1IndustryN
p2Data['IndCountMissing'] = p2Data['IndCountALL'] - p2Data['IndCountComplete']
p2Data['MissingPct'] = p2Data['IndCountMissing']/p2Data['IndCountALL']
p2Data['MissingPct'] = p2Data['MissingPct'].round(2)

missingN = p2Data['IndCountMissing'].sum()

plt.figure(figsize=(15, 10))
plt.bar(p2Data['Sector'],100*p2Data['MissingPct'])

plt.xlabel('Sector')
plt.ylabel('Pct Missing (%)')
plt.title('Figure 2: How much data is missing from each Sector? (n~'+str(missingN)+")")

plt.xticks(rotation=90) # Rotating x-axis labels for better readability

plt.show()# Displaying the histogram

# We note that this Communications Services may be harder to impute as just over half of their index participants do not have data, and there is a very small sample (n~12) for those who do.

Two Variable Plots¶

  • Figure 3: How are Total ESG Scores distributed by Sector? (Two Variable 1)
    • We note that the total ESG score for each company is the sum of the underlying E,S,G Ratings.
    • By employing a boxplot, we can see how total scores are distributed when grouped by sector.
    • We can start to form hypothesis on how non-financial risk is measured in a sector to sector basis.
    • We acknowledge that the total score is in a range of 0-100 where the larger the score the more risk-present.
    • For a similar pattern of thought that we cannot accurately transpose the performance expectations of an accountant on to a plumber, we too must note here that different sectors will carry some level of autonomous risk based solely on their business activities as a going concern.
    • We note that the standardised nature of these variables allows us to identify outliers across the Information Technology, Consumer Surplus, Materials, and Real Estate Sectors.
    • This may be useful for Investment Managers when undertaking a best-in-class review for screening in the portfolio construction process.
In [ ]:
# p3 - boxplit of total esg score distribution

plt.figure(figsize = (15,8))
sns.boxplot(x=workingData['Sector'],y=workingData['TotalESGScore']);

# Add labels# Add labels
plt.title('Figure 3: How are Total ESG Scores distributed by Sector? (n~'+str(sampleSize)+')', fontsize = 15)
plt.xticks(rotation=90) # Rotating x-axis labels for better readability
# plt.

plt.show();
  • Figure 4: What is the Distribution of Market Capitalisation by Secor? (Two Variable 2)
    • We may employ a transposed box-plot in this instance to better understand the nature of Market Capitalisation outliers by sector.
    • We want to see where the biggest companies sit relative to one another.
    • We also note which sectors inherently have more outliers as an indirect attempt to understand the levels of competition at a sector-specific level.
    • The outliers we note that the 2nd to 6th panels of the chart identify increasingly larger companies which are dominated by Information Technology, Communication Services, and Consumer Discretionary. This is somewhat expected due to their inter-connectivity in the marketing and sales pipeline over a period in which ecommerce was more viable.
    • We anticipate that a higher count of outliers may relate to a more monopolistic manner of market competition.
    • The Materials sector is the sole facet without an outlier which may imply a market competition with higher elasticities and substitutability of marketable product - after all, a shovel or nail can only be so unique for their desired purpose.
In [ ]:
# p4 - rotated boxplot of market cap by sector

plt.figure(figsize = (15,8))

# Transforming the MarketCap column to logarithmic scale
workingData['LogMarketCap'] = np.log10(workingData['MarketCap'])

sns.boxplot(y=workingData['Sector'], x=workingData['LogMarketCap']); # need to find a way to transform MktCap

# Add labels

plt.title('Figure 4: What is the Distribution of Market Capitalisation by Sector? (n~'+str(sampleSize)+')', fontsize = 15)
plt.grid(True)
plt.show();

# Note that while this is skewed by outliers the poinnt remains the same with the grouping and frequency of outliers per sector.
# Materials is the only sector without outliers on the top-side (big fish)

Three Variable Plots¶

  • Figure 5: Are bigger comanies better, badder, or a bit of both? (Three Variable 1)
    • We review Market Capitalisation by Sector by Controversy Level in a strip plot to understand if there is a discernable relationship between size of the company and its controversy.
    • We employ a traffic-light system over 6 ordinal datapoints to visually describe the relationship and review to understand where the more controversial companies sit.
    • We define greater controversy on a scale from 0 to 5 (severe) and let this exist as an anchor for our understanding and expectations prior to further review.
    • A loose understanding of any commerical environment may lead us to expect that companies who operate in a manner that contradicts modern societal values may damage the social license they operate with.
    • We note from this chart that a bunched grouping indicates that, like a choir, most sectors are singing similar songs both in time and on key.
    • We note Financials and Utilities to have more controversial companies whose size is somewhat centrally distributed.
    • Three Communication Services companies (which includes the two largest) have higher controversy which may indicate a systemic risk inherit to the underlying services of the Sectors specific industries.
    • The two biggest companies (both overall and in Information Trchnology) sit with a modest, middle-of-the-park, measurement which may lead us to understand the responsibility their governance committees may shoulder to exercise some mindfulness on their market impact.
    • The Real Estate sector has a majority in the lower risk category which may indicate that either the controversy measurement itself may not capture the underyling industry-specific risk for this sector.
In [ ]:
# p5 - Is Bigger Badder - what is the relationship between Market Capitalisation and Controversy by Sector?

MarketCapMil = workingData['MarketCap']/1000000

plt.rcParams['axes.facecolor'] = 'white'

# Plot Figure

plt.figure(figsize = (15,12))

sns.stripplot(y = MarketCapMil, x = workingData['Sector'], hue = workingData['ControversyLevel'], jitter=True,
    dodge=True, palette = trafficLights, size=15 )


# Add labels
plt.title('Figure 5: Are bigger comanies better, badder, or a bit of both? n~'+str(sampleSize)+')')
plt.legend(loc = 'upper right')
plt.ylabel(" Market Cap $T")
plt.xticks(rotation=90)
plt.show();
  • Figure 6: How is Beta distributed by Sector by Controversy? (Three Variable 2)
    • The boxplots here provide a more immediately digestible understanding of how Beta risk is distributed when faceted by Controversy Level per sector.
    • The Beta a specific company has will be measured as a level of comparable risk to an index in which it may exist. Over 1 denotes more riskier than average, under 1 denotes the opposite.
    • The boxplots help us understand any immediately identifiable relationship between higher investment risk and higher controversy as we anticipate irrational actors as a part of inefficient market mechanics.
    • Visually, the more we see of specific colours, the more we understand wider distribution at that specific Sectors Controversy Level.
    • Where more controversial companies were also higher risk, we would see tight groupings towards the bottom right in each iteration.
    • Where the data is limited, there will be little or no visual aid to determine the faceted distribution. Each Sector has under-represented, or unrepresented controversy levels which is wholly indicative of the sample size limitations but may become more apparent when using wider indicies across a more comprehensive investment universe.
In [ ]:
# p6 - look at the dipserson of Beta risk as facted by controversy level

# Define plot
plt.figure(figsize = (20,15))
sns.boxplot( data = workingData, x = 'Beta', y= 'Sector', hue = 'ControversyLevel', palette = trafficLights)

# Add labels
plt.title('Figure 6: How is Beta distributed by Sector by Controversy? n~'+str(sampleSize)+')', fontsize = 15);
plt.legend(loc = 'upper right')

plt.show();

Summary and Conclusion¶

The objective at hand is to predict the target features of Envifonmental, Social, and Govnernance Rankings.

We are graced with data whose strucural integrity requires limited wrangling. The data are indexed with a Primary Key (Stock Symbol) and afford a scalable structural integrity should additional providers be incorporated for added breadth.

The data we have on the Index allows technicians to manage their expectations with a reasonable effectiveness at the start of their analysis with roughly 15% of the initial scope screened out.

We may be mindful of Sectors whose sample sizes are quite low as we seek further quantiative interrogration.

In order to better understand correlation from causeation in the calibration of our model we will proceed to review the inter-sector commonalities of both Financial and Non-Financial risk measurements in Phase 2.

At a high level we now know that there is still more to discover when considering the relationship between financial and non-financial risk measurements.

References¶

Aroussi, R. (n.d.). yfinance: Yahoo! Finance market data downloader. [online] PyPI. Available at: https://pypi.org/project/yfinance/.

MSCI (2023). GICS - Global Industry Classification Standard. [online] www.msci.com. Available at: https://www.msci.com/our-solutions/indexes/gics.

Rudden, K. (2019). Scraping ESG Scores. [online] www.kylerudden.com. Available at: https://www.kylerudden.com/blog/scraping-esg-scores/ [Accessed 9 May 2023].

Sullivan, K., Bujno, M., Sullivan, K. and Bujno, M. (2023). Emerging trends in ESG governance for 2023. [online] The Harvard Law School Forum on Corporate Governance. Available at: https://corpgov.law.harvard.edu/2023/03/26/emerging-trends-in-esg-governance-for-2023/ [Accessed 10 Apr. 2023].

Sustainalytics (2021). ESG Risk Ratings - Methodology Abstract Version 2.1. [online] Available at: https://www.sustainalytics.com/docs/knowledgehublibraries/default-document-library/sustainalytics_esg-risk-rating_methodology-abstract-(2).pdf?sfvrsn=15b2cba1_0 [Accessed 18 Apr. 2023].

Sustainalytics (n.d.). Overview of Sustainalytics’ ESG Risk Ratings. [online] Available at: https://connect.sustainalytics.com/hubfs/SFS/Sustainalytics%20ESG%20Risk%20Ratings_Issuer%20Backgrounder.pdf.

sustainalytics.com. (n.d.). ESG Risk Ratings. [online] Available at: https://www.sustainalytics.com/esg-data.

www.spglobal.com. (2023). S&P 500® - S&P Dow Jones Indices. [online] Available at: https://www.spglobal.com/spdji/en/indices/equity/sp-500/#data.

Appendix A: Pairwise plot of all variables¶

In [ ]:
sns.pairplot(workingData) # 
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1861a6aad10>